Now to actually make the vowel plots! This document goes into detail about how I decided to make them the way I did and how to implement them in ggplot, but if you just want to see the final results, jump down to here.
However, the IPA symbols aren’t encoded correctly. They’ll render in RStudio, but not when Quarto renders the document to HTML, or always when ggplot renders the plots. This isn’t what we want:
ɑ, æ, ɔ, ə, ɛ, i, ɪ, o, u, ʊ, ʌ
So, the next step is to enter the unicode values manually (copied from this Wikipedia page):
vowels <-c("i_lower"="\u0069", # i (close front unrounded)"i_upper"="\u026A", # ɪ (near-close front unrounded)"epsilon"="\u025B", # ɛ (open-mid front unrounded)"ash"="\u00E6", # æ (near-open front unrounded)"schwa"="\u0259", # ə (mid central)"horseshoe"="\u028A", # ʊ (near-close near-back rounded)"u"="\u0075", # u (close back rounded)"o"="\u006F", # o (close-mid back rounded)"hat"="\u028C", # ʌ (open-mid back unrounded)"open_o"="\u0254", # ɔ (open-mid back rounded)"alpha"="\u0251"# ɑ (open back unrounded))
These are ordered from front to back, then close to open (Figure 4).
Then match the unicode for the IPA symbol to the words:
formants %<>%mutate(Vowel =case_when( Word %in%c("ball", "father", "honorific", "lot", "palm", "start") ~ vowels["alpha"], Word %in%c("bang", "bath", "hand", "laugh", "trap") ~ vowels["ash"], Word %in%c("bought", "cloth", "core", "north", "thought", "wrong") ~ vowels["open_o"], Word %in%c("among", "famous", "support") ~ vowels["schwa"], Word %in%c("bet", "dress", "guest", "says", "square") ~ vowels["epsilon"], Word %in%c("beat", "believe", "fleece", "people") ~ vowels["i_lower"], Word %in%c("bit", "finish", "kit", "near", "pin") ~ vowels["i_upper"], Word %in%c("force", "goat") ~ vowels["o"], Word %in%c("blue", "goose", "through", "who") ~ vowels["u"], Word %in%c("could", "cure", "put", "foot") ~ vowels["horseshoe"], Word %in%c("another", "but", "fun", "strut") ~ vowels["hat"], ) %>%factor(levels = vowels, ordered =TRUE))str(formants)
1
If the value in the Word column is ball, father, honorific, lot, or palm, then assign the alpha value from the vowels list.
2
Convert character to factor, then specify the order of the factors (same as in vowels list above) to make sure it stays consistent.
The Lingthusiasm font is Josefin Sans, which is available from Google Fonts.
I downloaded and installed it to my computer. There are a number of different ways to add new fonts without having to install them separately outside of RStudio, such as font_add_google() from the showtext package. However, that method was causing errors rendering the IPA symbols.
systemfonts() shows the list of fonts installed on my computer that R recognizes, and it finds Josefin Sans:
# A tibble: 3 × 2
name value
<chr> <chr>
1 path "C:\\Users\\betha\\AppData\\Local\\Microsoft\\Windows\\Fonts\\JosefinS…
2 name "JosefinSans-Thin"
3 family "Josefin Sans"
However, the fonts loaded by default just include Times New Roman, Arial, and Courier New:
windowsFonts()
$serif
[1] "TT Times New Roman"
$sans
[1] "TT Arial"
$mono
[1] "TT Courier New"
This tells R to load Josefin Sans into the set of available fonts, so text will render in Josefin Sans if family = sans_alt, but stick with the default sans font otherwise (and not break the IPA symbols).
Take the full data set, group it by Speaker then List then Vowel, and then calculate the means of F1 and F2 for each Speaker x List x Vowel.
2
All layers of the plot have F2 on the X axis, F1 on the Y axis, and are labelled by Vowel.
3
Write the vowel symbols (because Label = Vowel) at the location of their means.
4
Make the text box background lingthusiasm green with no outline.
5
Make the text white, size 4.5 (note that this is on a different scale than the rest of the text sizes specified in later theme()), vertically and horizontally centered.
6
Set the size of the text boxes, using snpc (squared normalized parent coordinates) to be relative to the size of the plot but always square.
7
No margins inside the text boxes and a slight curve on the corners.
8
Split the plot to have Gretchen’s data in the top panels and Lauren’s data in the bottom panels, and the data from the Lingthusiasm episodes in the left panels and the data from the Wells lexical set recordings in the right panels.
9
Change the default theme to have a white background with no grid lines.
10
Change all the lines (axis lines, axis ticks, outline around panels, outline around panel labels) to be the lingthusiasm navy.
11
Make all the text navy Josefin Sans. Set the base size as 12, but make the text of the speaker panel labels bigger.
12
Set the title, and leave the other axis/legend labels as their default values of “F1”, “F2”, and “Vowel.”
Figure 2: Vowel means (default axes).
(Sidenote: saving the theme specifications so we don’t have to keep retyping theme.)
However, vowel plots typically have their axes reversed, so that the highest value of (F1, F2) is at the bottom left corner instead of the top right corner. This isn’t standard data visualization procedure, but it has a cool and useful result.
Just annotating the lines that changed from the previous chunk.
2
Add this to flip the X axis. Specify breaks because the default values aren’t even.
3
Add this to flip the Y axis. The limits (see how they’re reversed) are specified because the defaults were a bit too narrow, and like this the axis ticks/labels are spaced more evenly.
4
Add this to specify the colors and font sizes etc.
Figure 3: Vowel means (reversed axes).
Now the layout resembles the IPA vowel chart! Front vowels are on the left, and back vowels are on the right; close vowels are on the top, and open vowels are on the bottom.
Figure 4: IPA Vowel Chart.
3.3 Plot Individual Data Points
Just plotting the means for each vowel loses a lot of information, so let’s take a look at the underlying data.
Now, we’ll distinguish between vowels by color. First, make a legend that will be easier to read than the default by creating a string that prints each vowel in its corresponding color (using the ggtext package to render the HTML formatting).
Vowel column is the list of vowels (unicode codes). Color column is the hex codes from the Bold palette in rcartocolor, the color set we’ve been using so far. (Using the first 11 values from the full palette, so the last color isn’t gray.)
2
Encase with HTML code, so that hex code becomes a color argument for the vowel character.
3
Merge into 1 string, with each value separated by a comma + space. 4.. Print, wrapping lines on each item.
(Note that ggplot will throw a warning like Warning in text_info(label, fontkey, fontfamily, font, fontsize, cache): unable to translate '<U+0251>png215' to native encoding, but it renders correctly, so the warnings are turned off in those code chunks.)
Passing the full data set, not the means by Speaker + Vowel + List, to ggplot.
2
Instead of geom_text(), geom_point() is a layer drawing scatterplot (size making the points slightly bigger than default).
3
Use the Bold color palette from the rcartocolor package to color-code the vowels. (There are 11 vowels, but I specify 12 colors here so the grey gets skipped.)
4
Limits need to be slightly bigger than plots with vowel means, and then breaks adjusted so that that Y axis labels don’t overlap with each other between the two panels.
5
element_markdown() from ggtext will render the HTML string. Use default sans serif font because Josefin Sans doesn’t have all of the IPA symbols.
6
Add the color-coded list of vowels as a subtitle.
7
Turn the default legend off.
Figure 5: Individual data points.
3.4 Plot Word Means
The data for each vowel consists of 3 different words. How different are they? First, let’s look at the Wells Lexical Set.
These next plots use the ggrepel package to make the word labels not overlap with each other or with the scatterplot points.
The panels are stacked vertically, because the word labels take up more space. fig-asp: 1.25 in this code chunk’s header makes it render tall enough.
Figure 7: Mean for each word in the Lingthusiasm Episode word list.
One thing that makes this plot a bit hard to interpret is that it’s not immediately clear which vowel in the word is the one being plotted. So, let’s make the vowel bold relative to the rest of the word.
So far we’ve been using ggtext to format text, but that doesn’t work with ggrepel. The workaround, like with the IPA vowels, is to just enter the unicode characters directly.
These are the codes for the Mathematical Sans Serif capital letters, in regular and bold faces. They’re copy-pasted in here manually even though the pattern is predictable, because procedurally generating strings with the \u prefix is a pain.
Function takes word as a string and alphabet_reg as a named list.
2
Split the word into individual letters.
3
Start string for the converted word.
4
For each letter, use the fact that alphabet_reg is named with the regular letters to get the unicode string for the current letter. Concatenate the letter pulled from alphabet_reg to the converted string.
For each item in the Word column of the formants dataframe, call the function to_unicode_caps() (defined in previous code chunk) on it. Pass alphabet_reg as the second argument to to_unicode_caps().
2
Insert the words_unicode into the formants dataframe as a column called Word_Label, after the Word column.
3
Convert the items in Word_Label from lists containing 1 string to just strings.
Mutating the Word_Label column multiple times, because several words have multiple vowels to swap. Swapping one vowel at a time is shorter than swapping one category of word at time.
2
First modification to Word_Label is all the words where “A” gets bolded.
3
If the value in the Word column is one of these items
4
Then pass the value of the Word_Label column to str_replace(). Replace the “a” from the regular-face set with the “a” from the bold-face set.
5
If the value in the Word column is not any of those words, keep the value of Word_Label the same.
6
Same logic for all the words where “E” gets bolded.
7
Same logic for all the words where “I” gets bolded.
8
Same logic for all the words where “O” gets bolded.
9
Same logic for all the words where “U” gets bolded.
10
There are a couple of exceptions: “believe”, because that’s the word where the second instance of the vowel gets bolded, not the first one. Replace the consecutive “I” and “E” from the regular-face set with the “I” and “E” from the bold-face set.
11
“Goose” is the only word where both O’s need to be bolded.
12
“Fleece” needs the first two, but not the third E bolded.
Figure 8: Mean for each word in the Lingthusiasm Episode word list.
This was one of the points where I (from the northeast US) realized how I don’t have a lot of experience with Australian accents, because I wasn’t entirely sure how much of the messiness in Lauren’s back vowel data was because her back vowels are in different locations, or because I picked words where she uses a different vowel than Gretchen and I do.